Using information on the tip states of a phylogenetic tree we can use evolutionary models to estimate the ancestral states of internal nodes and transitions between states.
We fit an \(Mk\) model of evolution to our tree and habitat preference data to understand how habitat preference evolved. For example does evolution readily allow transitions between specialist habitat preferences (i.e. from marine mud only to terrestrial only).
We also try and fit an \(Mk\) model of evolution when we combine two traits together (habitat preference and whether or not the species has high or low diversification rate as identified by BAMM).
Ancestral state reconstructions can also be estimated from the MuSSE and HiSSE models we are aiming to fit later on. These models would have a lot of free parameters and we are unsure whether they can be fit with the dataset we have. We hope to reduce the number of free parameters in those models by estimating the transition matrix which best fits the data in the \(Mk\) model. This would reduce the number of free parameters.
2 TL;DR
Go to Section 12 for the best \(Mk\) model and a discussion of the transition matrix.
3 Progress and results
Done ancestral state reconstructions using a variety of methods:
ape::ace()
castor::asr_mk_model()
diversitree::fit_mk()
phytools::make.simmap()
Have written a quick primer to BayesTraits but have not used it on our tree due to the time it takes to run the analysis with a transition matrix where all states are different.
Have compared the three standard models of evolution for the transition matrix (equal rates, symmetric rates, and all rates different) using ape::ace() and the all-rates different method is best. Then did model simplification to find the best model for the transition matrix.
Have plotted the ancestral state reconstructions of a subset of the tree. For the whole tree my computer gave up.
Have plotted the transition matrix in multiple ways, see ?@sec-best-model
4 Fitting transition rate models and doing ancestral state reconstructions
We can do ancestral state reconstruction with a couple of different methods in R as well as trying in BayesTraits.
For discrete traits, the most commonly used model for evolution on tree is called the \(Mk\) model. \(M\) stands for Markov because the modelled process is a continuous-time Markov chain, and \(k\) because the model is generalised to include an arbitrary number (\(k\)) states.
The central attribute of the \(Mk\) model is a the transition matrix, \(Q\), which gives the instantaneous transition rates between states.
An important distinction in ancestral state reconstruction for discrete characters is joint vs. marginal reconstruction. Joint reconstruction is finding the set of character states at all nodes that maximise the likelihood. Marginal reconstruction is finding the state at the current node that maximises the likelihood integrating over all the other states at all nodes, in proportion to their probability.
We can do both marginal reconstruction using ape::ace() and castor::asr_mk_model() and joint reconstruction using ape::ace().
An alternative tactic to the one outlined above is to use an MCMC approach to sample character histories from their posterior probability distribution. This is called stochastic character mapping. The model is the same but in this case we get a sample of unambiguous histories for our discrete character’s evolution on the tree - rather than a probability distribution for the character at nodes.
Finally we can use BayesTraits to do both maximum likelihood and MCMC approaches. We do not use this approach in this walk-through.
The evolutionary models that are used to reconstruct ancestral states have a transition matrix between discrete states that define the rates of transitions between states on the tree. In most of these models, there are three that are easy to define: ARD means all transition rates are different, SYM means transitions to and from the same states are the same, and ER is an equal rates model where all rates between states are the same. These transition matrices can also be custom made but for now these three seem sensible starting points. Transition matrices also underpin the MuSSE and BiSSE models we are hoping to fit.
5 Load in R packages
First we will load in R packages used and the metadata file used and wrangled in a previous walk-through.
We also load in the final phylogenetic tree, the colours used for habitat preferences, and the position of shift nodes.
Code
# load packageslibrary(here)library(tidyverse)library(ggtree)library(ggnewscale)library(RColorBrewer)library(patchwork)library(ape)library(phytools)library(BAMMtools)library(coda)library(MetBrewer)library(ggpp)library(castor)library(diversitree)library(tidygraph)library(igraph)library(ggraph)library(GGally)library(ggrepel)# set where I am in the projecthere::i_am('scripts/sequencing_rpoB/analyses/estimating_transition_rates.qmd')# read in metadatad_meta <-read.csv(here('data/sequencing_rpoB/processed/asv_metadata.csv'))# read in habitat trait colourscols_hab <-readRDS(here('data/sequencing_rpoB/processed/habitat_colours.rds'))# read in treetree <-read.tree(here('data/sequencing_rpoB/bamm/rerooted-pruned-chronopl10.tre'))# edit tree labelsd_labels <-data.frame(tip_label = tree$tip.label) %>%separate(., tip_label, c('part1', 'part2', 'part3'), sep ='_', remove =FALSE) %>%unite('tip_label_new', c(part1, part2), sep ='_')tree$tip.label <- d_labels$tip_label_new# read in shift nodesshiftnodes <-readRDS(here('data/sequencing_rpoB/processed/shiftnodes.rds'))
6 Custom function
We will write a custom function for getting the transition matrix into a format that is easy to plot using ggplot2.
Code
# get dataframe of transition ratesace_transition_df <-function(ace_object){ temp_1 <- ace_object$index.matrixcolnames(temp_1) <-rownames(temp_1) <-colnames(ace_object$lik.anc)# turn into dataframe temp_1 <-as_tibble(temp_1, rownames ='state_1') %>%# make long format, the column names become state 2pivot_longer(freshwater:terrestrial, names_to ='state_2', values_to ='order') %>%mutate(free_param =ifelse(is.na(order)|order==0, 'no', 'yes'),num_params =length(ace_object$rates)) temp_1$transition_rate <-NA temp_1$transition_rate[temp_1$order ==0] =0# run for loop to input estimated transition ratesfor(i in1:length(ace_object$rates)){ temp_1[temp_1$order %in% i,]$transition_rate <- ace_object$rates[i] }return(select(temp_1, -order, state_1, state_2, transition_rate, free_param, num_params))}
7 Create a master dataframe for the trait states
Different methods for estimating ancestral states take different values for the traits. Some only allow for numeric values, whereas others allow characters. Some do not allow colons in the name, and one (MBASR which uses Mr Bayes) needs the first trait value to be 0. To allow for us to semi-easily link across different methods, we will make a data frame of our unique tip states and then turn them into a numeric vector. The ordering is all done alphabetically.
Code
d_meta <-tibble(tip_label = tree$tip.label) %>%left_join(., d_meta)# make sure order of habitat preference is the same in the trait vector as the tip labelssum(d_meta$tip_label == tree$tip.label) ==length(tree$tip.label)
[1] TRUE
Code
# SUCCESS if TRUE# replace NA of outgroup - make it freshwater:terrestrial - the most common state# phytools::make.simmap cannot take NAshab_pref <-setNames(d_meta$habitat_preference, d_meta$tip_label)hab_pref[is.na(hab_pref)] <-'freshwater:terrestrial'# make habitat preference vector numerichab_pref_num <-as.numeric(as.factor(hab_pref)) -1hab_pref_num <-setNames(hab_pref_num, d_meta$tip_label)hab_pref_num2 <- hab_pref_num +1# coding from numeric to charactercoding <-tibble(hab_pref =unname(hab_pref), hab_pref_num =unname(hab_pref_num)) %>%distinct() %>%mutate(hab_pref2 =gsub(':', '.', hab_pref),hab_pref_num2 = hab_pref_num +1) %>%arrange(hab_pref) %>%mutate(hab_pref_axis =gsub(':', '/ ', hab_pref),hab_pref_axis =gsub('_', ' ', hab_pref_axis),# rename the columns for easy naminginitials =c('F', 'FM', 'FT', 'G', 'M', 'MT', 'T'))coding
# A tibble: 7 × 6
hab_pref hab_pref_num hab_pref2 hab_p…¹ hab_p…² initi…³
<chr> <dbl> <chr> <dbl> <chr> <chr>
1 freshwater 0 freshwater 1 freshw… F
2 freshwater:mud_and_shore 1 freshwater.mud… 2 freshw… FM
3 freshwater:terrestrial 2 freshwater.ter… 3 freshw… FT
4 generalist 3 generalist 4 genera… G
5 mud_and_shore 4 mud_and_shore 5 mud an… M
6 mud_and_shore:terrestrial 5 mud_and_shore.… 6 mud an… MT
7 terrestrial 6 terrestrial 7 terres… T
# … with abbreviated variable names ¹hab_pref_num2, ²hab_pref_axis, ³initials
Code
# save out habitat preference vectorsaveRDS(hab_pref, here("data/sequencing_rpoB/processed/hab_pref_vec.rds"))
This results in a dataframe which allows us to easily link between numeric and character values, and also gives us a vector - of the same length of tips in our tree - which we can use for estimating ancestral states.
8 Estimate ancestral states using ape::ace()
First lets fit each of the models (ER, SYM, and ARD) using ape::ace(). We can compare these models using AIC and likelihood ratio tests to see which model fits the tree best.
Code
# do ancestral reconstruction with apeancestral_states_ape <-ace(hab_pref, tree, model="ER", type="discrete")ancestral_states_ape1 <-ace(hab_pref, tree, model="SYM", type="discrete")ancestral_states_ape2 <-ace(hab_pref, tree, model="ARD", type="discrete")
We can do model comparison using both AIC scores and likelihood ratio tests.
# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2)))mod_compare
# A tibble: 3 × 2
mod aic
<chr> <dbl>
1 ER 10550.
2 SYM 8582.
3 ARD 8461.
From these comparisons we can see that the ARD model, where all rates in the transition rate are allowed to be different, is the best model in terms of both doing likelihood ratio tests of nested models and AIC scores.
8.1 Do some rudimentary checks of ape::ace()
We can grab the ancestral states from the output of ace(). We can do some checks which include checking whether any of the transition rates are negative or whether they have any standard errors that are NaN, and whether any of the node states have probabilities > 1. This hints the model is too complex/has not converged/not doing a good job.
We can format the transition matrix from the output of ace::ape() to look at the rates. As it is not stated in the ape::ace() directory, we assume the rates are entered as in castor::asr_mk_model(), where the [r,c]-th entry is the transition rate from state r to state c. This is confirmed on this help page.
Code
# see whether any of the SEs are NA. This hints that its not converged properlyancestral_states_ape$se
[1] 0.007366617
Code
ancestral_states_ape1$se
[1] 0.03892461 0.14981260 0.01579165 0.03222234 0.01275812 0.09737104
[7] 0.04687340 NaN 0.01887849 NaN 0.04728073 0.02064626
[13] 0.04107843 0.02381328 0.67620785 NaN NaN 0.01832579
[19] 0.01817874 0.03706546 0.02527278
# as you can see both the symmetric and all rates different models have NaNs for some/all of the transitions# get marginal likelihood of each ancestral state at each noded_acr_ape <- ancestral_states_ape2$lik.anc %>%as_tibble(rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_ape', names_to ='habitat_preference')d_acr_ape_summary <-group_by(d_acr_ape, node) %>%summarise(total =sum(probability_ape), .groups ='drop') %>%filter(total !=1)nrow(d_acr_ape_summary)
[1] 7
Code
# apparently 7 nodes that do not equal 1, but they are very close to 1.
Some of the standard errors for the ARD model are NaN, suggesting convergence problems. However, there are only seven internal nodes with probabilities different from 1, so these estimates are not completely crazy.
8.2 Plotting transition matrices
We can plot the transition matrices to see what the estimated rates are. This uses our custom function ace_transition_df().
The transition matrices plotted here can be interpreted as follows:
the transition rates are from state 1 (on the y axis) to state 2 (on the x axis)
diagonal rates are when state 1 equals state 2 and are therefore not estimated
Red boxes indicate rates which are estimated. This will be useful when we start using custom matrices.
Code
# bind together the transition matrices from each of the standard models for the transition matricesd_models <-bind_rows(ace_transition_df(ancestral_states_ape) %>%mutate(model ='equal rates'),ace_transition_df(ancestral_states_ape1) %>%mutate(model ='symmetric'),ace_transition_df(ancestral_states_ape2) %>%mutate(model ='all rates different')) %>%# add in colums for reordering and naming of axesleft_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2))# plot transition matrices# make heatmap of changesggplot(d_models, aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate), col ='black', width =0.9, height =0.9) +geom_tile(aes(alpha = transition_rate), fill =NA, col ='red', size =1.1, filter(d_models, free_param =='yes'), width =0.9, height =0.9) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5)) +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2') +coord_fixed() +facet_wrap(~model)
8.3 Doing model simplification on the all rates different model
So we have seen that the all rates different model is better than the SYM model, but we can also use the results of the ARD model to make some of the rates 0.
We can view the all rates different transition matrix only.
The transitions where the rates are estimated to be 0 are:
freshwater to marine mud + terrestrial
freshwater + terrestrial to marine mud
generalist to marine mud
marine mud to freshwater + terrestrial
marine mud to generalist
terrestrial to freshwater + marine mud
terrestrial to generalist
We can create a custom matrix and fix these parameters to be 0.
Code
num_states <-unique(hab_pref) %>%length()# we will set up the custom matrix, this has 49 numbers and then we have to set the correct numbers to 0custom_matrix <-matrix(1:num_states^2, nrow=num_states)# make all diagonal numbers NAfor(i in1:nrow(custom_matrix)){ custom_matrix[i,i] <-0}# make specific transitions 0 based on the output of the ARD.custom_matrix[custom_matrix %in%c(14, 19, 26, 28, 31, 32, 36)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix[custom_matrix !=0] <-1:length(custom_matrix[custom_matrix !=0])custom_matrix
We can now try and refit the \(Mk\) model to the tree and habitat preference data, but using our custom transition matrix.
Code
# lets try refit the model with these parameters fixed to zeroancestral_states_ape3 <-ace(hab_pref, tree, model=custom_matrix, type="discrete")
We can now do model simplification, making sure we compare the simpler model (with fewer free parameters) to the more complex model (the all rates different model). We can also compute the AIC and add it to our table.
Code
# compare the less complicated model with the more complicated modelanova(ancestral_states_ape3, ancestral_states_ape2)
Likelihood Ratio Test Table
Log lik. Df Df change Resid. Dev Pr(>|Chi|)
1 -4170.0 35
2 -4188.3 42 7 -36.628 1
Code
# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3)))arrange(mod_compare, aic)
# A tibble: 4 × 2
mod aic
<chr> <dbl>
1 ancestral_states_ape3 8410.
2 ARD 8461.
3 SYM 8582.
4 ER 10550.
So the model where we explicit remove 0s from the model fitting process fits the model loads better (lower AIC, anova p value = 1 indicating the extra parameters - the ARD model - do not help explain the data any better).
Lets now look at the transition matrix from the most recent model to see what can be removed.
The black boxes with 0 in are rates that we fixed to be 0. There are now no other rates that are 0. We will remove the next lowest rate to see whether its presence significantly improves the model fit. This is freshwater + terrestrial to marine mud + terrestrial.
# input the rates into the custom matrixcustom_matrix2 <- custom_matrix# set some more rates to 0custom_matrix2[custom_matrix2 %in%c(15,20,25,26,31,33)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix2[custom_matrix2 !=0] <-1:length(custom_matrix2[custom_matrix2 !=0])
We can now re run the \(Mk\) model.
Code
# lets try refit the model with these parameters fixed to zeroancestral_states_ape4 <-ace(hab_pref, tree, model=custom_matrix2, type="discrete")
We can do model selection as before, but now this new model (fewer parameters) is compared to the previous custom matrix model (more parameters).
Code
# compare the less complicated model with the more complicated modelanova(ancestral_states_ape4, ancestral_states_ape3)
These zeroes somehow make the model loads worse, so our best model is apparently ancestral_states_ape3. When plotting this model we can see it is doing an extremely bad job.
Instead we can try and refit ancestral_states_ape3 with values <0.1 set to 0. This will hopefully allow us to move past this problem.
Code
# create new custom matrixcustom_matrix3 <- custom_matrix# get index of rates which are <0.1which(ancestral_states_ape3$rates <0.1)
[1] 4 6 15 17 20 24 25 26 30 31 33 34
Code
custom_matrix3[custom_matrix3 %in%which(ancestral_states_ape3$rates <0.1)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix3[custom_matrix3 !=0] <-1:length(custom_matrix3[custom_matrix3 !=0])# fit habitat preference modelancestral_states_ape5 <-ace(hab_pref, tree, model=custom_matrix3, type="discrete")anova(ancestral_states_ape5, ancestral_states_ape3)
Likelihood Ratio Test Table
Log lik. Df Df change Resid. Dev Pr(>|Chi|)
1 -4171.5 23
2 -4170.0 35 12 2.9478 0.9959
Code
# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5)))arrange(mod_compare, aic)
# A tibble: 6 × 2
mod aic
<chr> <dbl>
1 ancestral_states_ape5 8389.
2 ancestral_states_ape3 8410.
3 ARD 8461.
4 SYM 8582.
5 ER 10550.
6 ancestral_states_ape4 11940.
The values for ape::ace() are very similar to those from castor::asr_mk_model(). This is good. There is one shift that is estimated to be 0 using castor, but it is > 2 for ape. This is the transition
We can see if this result is the same for diversitree, and hints this parameter could be set to 0. However right now we will not do that.
10 Estimate ancestral states with diversitree
We can also do the equivalent analysis using diversitree. This might be beneficial because it can also be used for BiSSE models later on.
It is a bit more involved in terms of defining the constraints on the transition matrix.
Code
# all of these methods needs a likelihood function, we can build a Mkn modellik <-make.mkn(tree, hab_pref_num2, k =max(hab_pref_num2))# constrain correct parameters to be 0lik_constrain <-constrain(lik, q14~0, q16~0, q17~0, q26~0, q27~0, q35~0, q36~0, q45~0, q47~0, q51~0, q52~0, q53~0, q54~0, q57~0, q63~0, q64~0, q71~0, q72~0, q74~0, q75~0)argnames(lik_constrain)
# need to pass start values to it - can grab these from the ape::ace, but we will just pass an average rate to the modelinits <-rep(mean(ancestral_states_ape6$rates), length(argnames(lik_constrain)))# find the maximum likelihood estimates of this modelmod_ml <-find.mle(lik_constrain, inits)# pass inits from the ancestral states model#inits2 <- ancestral_states_ape6$rates#mod_mcmc <- mcmc(lik_constrain, inits2, nsteps = 10, w = 1)unname(mod_ml$par)
# save out diversitree modelsaveRDS(mod_ml, here("data/sequencing_rpoB/processed/diversitree_mk_model_mac.rds"))
We can compare the rates estimated from ape::ace() and diversitree::find.mle() by putting putting the output of diversitree into a dataframe and combining it with our current comparison dataframe.
Code
# grab names of q matrices from diversitreeq_matrix_diversitree <-names(mod_ml$par)# create dataframe with theseq_matrix <-tibble(q_matrix_diversitree = q_matrix_diversitree) %>%# grab first and second elements that are the habitat codingsmutate(state_1_num =as.numeric(substr(q_matrix_diversitree, 2,2)),state_2_num =as.numeric(substr(q_matrix_diversitree, 3,3)),transition_rate_diversitree =unlist(mod_ml$par)) %>%left_join(select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2)) %>%left_join(select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2)) %>%select(-state_1_num, -state_2_num, -q_matrix_diversitree)# combine into comparison dataframed_transition_compare <-left_join(d_transition_compare, q_matrix) %>%select(state_1, state_2, transition_rate, transition_rate_castor, transition_rate_diversitree) %>%mutate(across(starts_with('transition'), round, 2))# plot to see similarityggplot(d_transition_compare, aes(transition_rate, transition_rate_diversitree)) +geom_abline(aes(intercept =0, slope =1)) +geom_point(size =2) +theme_bw()
We can see the estimates are equivalent! This is great news! However the model convergence code is also -1 which hints that the model has not converged. So it is complicated.
11 Estimate ancestral states using stochastic character mapping using phytools::make.simmap()
An alternative approach is to do stochastic mapping of the ancestral states. What this does is it calculates the conditional likelihood for each character state at each node of the tree, including the root, and then simulate ancestral states at each internal node by sampling from the posterior distribution of states.
We need to do some initial changes to our transition matrix which we can pass to make_simmap() so the \(M_{k}\) model is not fitted by this model. Importantly we need to make sure the rows add up to 0, so we need to make the diagonal value of the transition matrix be the negative of the sum of each row.
Code
# rename best transition matrix for using it laterbest_matrix <- custom_matrix4rownames(best_matrix) <-colnames(best_matrix) <-sort(coding$hab_pref)# get dataframe of transition matrixd_best_matrix <-ace_transition_df(ancestral_states_ape6) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) # get best matrixbest_matrix_estimates <-select(d_best_matrix, state_1, state_2, transition_rate) %>%spread(state_2, transition_rate) %>%column_to_rownames(var ='state_1') %>%mutate(across(everything(), replace_na, 0)) %>%as.matrix()# make diagonal values make things sum to 0diag(best_matrix_estimates) <--rowSums(best_matrix_estimates)
We can now run make.simmap() on our tree. We will run it for 100 simulations for now to reduce object space.
Code
# do stochastic mapping of character traits using phytools# feed in the best transition matrix previously found using other methodsancestral_states_phytools <-make.simmap(tree, hab_pref, nsim =100, Q = best_matrix_estimates)
It has finished! We can count the number of transitions between each state using describe.simmap(). We can then plot these to see the distribution of numbers of transitions.
Code
# summarise number of switches between states and time spent in each statesimmap_summary <-describe.simmap(ancestral_states_phytools, plot=FALSE)# coerce transitions into dataframed_transitions <-as.data.frame(simmap_summary$count, col.names =colnames(simmap_summary$count)) %>%pivot_longer(cols =c(everything(), -N), names_to ='transition', values_to ='n_transitions') %>%separate(transition, c('state_1', 'state_2'), sep =',', remove =FALSE) %>%mutate(label =gsub(',', ' -> ', transition))# plot thesegroup_by(d_transitions, transition) %>%filter(max(n_transitions) >0) %>%mutate(average =mean(n_transitions)) %>%ungroup() %>%mutate(label =fct_reorder(label, desc(average))) %>%ggplot() +geom_histogram(aes(n_transitions), fill ='white', col ='black', binwidth =function(x) (max(x)-min(x))/nclass.FD(x)) +facet_wrap(~ label, scales ='free_x') +theme_bw() +labs(title ='Distribution of number of transitions between states',subtitle ='Facets are ordered by common transitions',x ='Number of transitions',y ='Count')
We can see the number of transitions between states is dominated by the terrestrial and freshwater environments, with relatively few transitions involving the marine mud environment.
We can also look at how was spent in each state in each character map iteration.
Code
# coerce time into dataframed_time <-as.data.frame(simmap_summary$times, col.names =colnames(simmap_summary$times)) %>%mutate(n =1:n()) %>%select(-total) %>%pivot_longer(cols =c(everything(), -n), names_to ='state', values_to ='time') %>%group_by(n) %>%mutate(prop = time/sum(time)) %>%ungroup()# plot thesegroup_by(d_time, state) %>%mutate(average =mean(prop)) %>%ungroup() %>%mutate(state =fct_reorder(state, desc(average))) %>%ggplot() +geom_histogram(aes(prop), fill ='white', col ='black', binwidth =function(x) (max(x)-min(x))/nclass.FD(x)) +facet_wrap(~ state, scales ='free_x') +theme_bw() +labs(title ='Proportion of time spent in each state',subtitle ='Facets are ordered in terms of time spent',x ='Proportion of time spent in state',y ='Count')
From this plot, we can see that the tree is dominated by four states: freshwater+terrestrial, terrestrial, marine mud, and freshwater. However, what is interesting is that even though marine mud is the third most common state, it is involved in very relatively few transitions. Transitions out of being a marine mud specialist only occur into freshwater+marine mud, and this is the second least common transition.
We can also compare the ancestral state estimates garnered from the best ape::ace() model and those from the average stochastic character map. These are likely to be extremely similar given we fixed the Q matrix in our stochastic character mapping process to save time.
Code
# grab ancestral state from simmapd_acr_phytools <-as.data.frame(simmap_summary$ace, col.names =colnames(simmap_summary$ace), row.names =row.names(simmap_summary$ace)) %>%rownames_to_column(., var ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_phytools', names_to ='habitat_preference')# grab out marginal likelihood for each node from ape::ace()d_acr_ape <- ancestral_states_ape6$lik.anc %>%as_tibble(rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_ape', names_to ='habitat_preference')# grab out marginal likelihood for each node from castor::asr_mk_model()d_acr_castor <- ancestral_state_castor$ancestral_likelihoodscolnames(d_acr_castor) <-sort(coding$hab_pref)rownames(d_acr_castor) <-rownames(ancestral_states_ape6$lik.anc)d_acr_castor <-as_tibble(d_acr_castor, rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_castor', names_to ='habitat_preference')# grab out marginal likelihood for each node from diversitree::find.mle()d_acr_diversitree <-asr.marginal(lik_constrain, unname(mod_ml$par)) %>%t() %>%as.data.frame(., col.names =colnames(ancestral_states_ape6$lik.anc), row.names =rownames(ancestral_states_ape6$lik.anc)) colnames(d_acr_diversitree) <-colnames(ancestral_states_ape6$lik.anc)d_acr_diversitree <- d_acr_diversitree %>%rownames_to_column(var ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_diversitree', names_to ='habitat_preference')# look at similarity between ace, castor, and phytools approachesd_acr_summary <-reduce(list(d_acr_ape, d_acr_phytools, d_acr_castor, d_acr_diversitree), left_join)# can plot all of these reconstructions to see if they are doing similar things# will use ggpairsggpairs(d_acr_summary, columns =c('probability_diversitree', 'probability_ape', 'probability_castor', 'probability_phytools'),mapping = ggplot2::aes(colour = habitat_preference),lower =list(continuous =wrap("smooth", se =FALSE, alpha =0.3, size=0.1))) +theme_bw(base_size =14)
Stochastic character mapping and the maximum likelihood approach estimate ancestral states in slightly different ways. In ape::ace(), ancestral states are inferred using marginal state reconstruction, whereas the estimates from the stochastic character map use the joint reconstruction (I think).
However, if you force the Q matrix to be maximum likelihood estimates then the estimates for the ancestral states should converge. This is what we see happening here. The diversitree estimates based on the marginal state reconstruction are equivalent to phytools. Something strange is going on with the castor estimates that I cannot work out, but three of our methods are extremely close with each other. This is good!
12 Looking at the best model for ancestral state reconstruction.
We can plot the best transition matrix using ggplot2.
From this transition matrix, we can see some really cool patterns.
The best model has 17 free parameters, down from 42 in the all rates different model.
It is much more common to transition to being more specialist than being more generalist.
generalist -> freshwater, generalist -> freshwater/terrestrial, generalist -> margine/terrestrial have high rates
freshwater/marine -> freshwater has high rate
marine/terrestrial -> marine has high rate
Only freshwater/terrestrial has a transition rate INTO generalist
Zero rates generally map to transitions across habitats (i.e. freshwater -> terrestrial).
This is only non-zero in 1 out of 12 cases.
We can have a look at these habitat preferences and their transition rates in a network, to visualise key stepping stones between states and highly connected trait states.
We will do this using ggraph. I have not implemented this yet.
First need to change our transition matrix into a network.
This plot is AMAZING and has loads of useful biological interpretation.
Marine mud is a common state (larger point), but has very low transition rates away from this state (and can only move into freshwater + marine mud)
Generalist is a rare state and has high rates transitioning away from it.
Rates going from a more polymorphic state to a simpler, more specialised state are generally higher than rates going from a monomorphic to polymorphic state.
Freshwater and its associated polymorphic traits (e.g. freshwater + terrestrial, freshwater + marine mud) are important states that facilitate a lot of transitions between other states.
Only 1 out of 6 transitions from monomorphic states to each other is possible in the best model (freshwater -> marine mud).
13 Plot ancestral state reconstructions using ggtree
We can plot the ancestral states using methods from ggtree. For this we need seven colours, one for each state. This is because it is really important to know whether we are confident the ancestral state reconstruction thinks we are in a polymorphic trait, or whether it is stuck between two (or three), monomorphic states.
The idea is that habitat transitions may result in a change in rate of speciation/diversification, so we can plot the nodes where transitions are most likely to occur over the tree to see if they align with any habitat transitions.
We tried this on the whole tree, but it broke my Mac so instead we are doing it on node 4416, which was identified as a parent of nodes which contained a high density of potential transitions.
Code
# grab out marginal likelihood for each node# will use results from the stochastic character mapping# add new colours to cols_habcols_hab2 <-c(cols_hab, '#00ffdc', '#032a63', '#1e5302')names(cols_hab2) <-c(names(cols_hab), 'freshwater:terrestrial', 'freshwater:mud_and_shore', 'mud_and_shore:terrestrial')# lets plot the nodepies for a subset of the tree. We create a plotly of the bamm transitions and identified a clade with lots of transitions in# lets try on a subset of the data for now. Lets try using all tips from node 4416, where we are pretty confident that a transition occurred in at some point# replace bootstrap value with node number so we can find it again# rename treetree2 <- tree# keep tree data for nodesd_tree <-as_tibble(tree2) %>%filter(!str_detect(label, 'otu'))# check node label is the same as the order in the dataframesum(d_tree$label == tree2$node.label) ==length(tree2$node.label)# replace node label with node numbertree2$node.label <- d_tree$node# subset treetree_sub <-extract.clade(tree2, 4416)# subset tree sub data to just keep nodesd_tree_sub <-as_tibble(tree_sub) %>%filter(!str_detect(label, 'otu'))# tree_sub$node.label <- filter(d_tree, node %in% d_tree_sub$label)# replace node labels in acr dataset with new node numbersd_acr_phytools_sub <-filter(d_acr_phytools, node %in% d_tree_sub$label) %>%pivot_wider(., names_from ='habitat_preference', values_from ='probability_phytools')d_acr_phytools_sub$node <- d_tree_sub$node# grab the relevant shift nodes and convert them to the right number for the subsetted treeshiftnodes_sub <- shiftnodes[shiftnodes %in% d_tree_sub$label]shiftnodes_sub <-tibble(shiftnode_original =as.character(shiftnodes_sub)) %>%left_join(select(d_tree_sub, node, shiftnode_original = label))# create node pies for each node# https://github.com/YuLab-SMU/ggtree/issues/419#issuecomment-877563385# cols_hab[1:3][sort(names(cols_hab[1:3]))]pies <-nodepie(d_acr_phytools_sub, cols =2:ncol(d_acr_phytools_sub), color = cols_hab2[sort(names(cols_hab2))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = habitat_preference), size =2, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab2, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +scale_fill_manual('Habitat preference', values = cols_hab2, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)# plot treep_acr <- tree_plot_sub %<+% d_pie +geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.png'), p_acr, height =10, width =10)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.pdf'), p_acr, height =10, width =10)p_acr# change probabilities to just be marine, freshwater or terrestriald_acr_single <-filter(d_acr_phytools, habitat_preference %in%c('terrestrial', 'mud_and_shore', 'freshwater'))d_acr_double <-filter(d_acr_ape, str_count(habitat_preference, ':') ==1) %>%separate_rows(habitat_preference, sep =':') %>%mutate(probability_ape = probability_ape/2)d_acr_general <-filter(d_acr_ape, habitat_preference =='generalist') %>%group_by(node) %>%do(data.frame(habitat_preference =c('terrestrial', 'mud_and_shore', 'freshwater'), probability_ape = .$probability_ape/3)) %>%ungroup()# combine to have just three habitatsd_acr_ape2 <-bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%group_by(node, habitat_preference) %>%summarise(probability_ape =sum(probability_ape), .groups ='drop') %>%pivot_wider(names_from ='habitat_preference', values_from ='probability_ape')pies <-nodepie(d_acr_phytools_sub, cols =2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size =2, shape =21, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +scale_fill_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)
We can see that lots of the shift changes are associated with nodes where the model is confident in the ancestral state (either freshwwater + terrestrial or marine mud). To me there is not clear evidence that the nodes where shifts are thought to occur are in areas where a transition is likely to have occurred.
An alternative approach here is to not only plot each monomorphic state in the pie charts at the node, and split each polymorphic state into the probability of it either being marine mud, terrestrial, or freshwater. I do not think I favour this approach in the long-term, but it produces a graph that is slightly more consistent with our previous colour schemes.
Code
# change probabilities to just be marine, freshwater or terrestriald_acr_single <-filter(d_acr_phytools, habitat_preference %in%c('terrestrial', 'mud_and_shore', 'freshwater'))d_acr_double <-filter(d_acr_phytools, str_count(habitat_preference, ':') ==1) %>%separate_rows(habitat_preference, sep =':') %>%mutate(probability_phytools = probability_phytools/2)d_acr_general <-filter(d_acr_phytools, habitat_preference =='generalist') %>%group_by(node) %>%do(data.frame(habitat_preference =c('terrestrial', 'mud_and_shore', 'freshwater'), probability_phytools = .$probability_phytools/3)) %>%ungroup()# combine to have just three habitatsd_acr2 <-bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%group_by(node, habitat_preference) %>%summarise(probability_phytools =sum(probability_phytools), .groups ='drop') %>%filter(., node %in% d_tree_sub$label) %>%pivot_wider(names_from ='habitat_preference', values_from ='probability_phytools')d_acr2$node <- d_tree_sub$nodepies <-nodepie(d_acr2, cols =2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr2$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size =2, shape =21, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +scale_fill_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)# plot treep_acr <- tree_plot_sub %<+% d_pie +geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.png'), p_acr, height =10, width =10)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.pdf'), p_acr, height =10, width =10)p_acr
14 How to compare BAMM results to ancestral state reconstructions
A key question here is how to link the BAMM analysis to these ancestral state reconstructions. One idea I had was to compare all the nodes which have rate shifts to the rest of the nodes in terms of marginal probabilities of ancestral states.
We can pretty easily do this by filtering the data on marginal state probabilities by whether they have been identified as a shift node or not.
We probably need to control for time from tip in this analysis somehow. Obviously deeper tips are more likely to be less certain in the evolutionary model.
This plot shows a density plot for the marginal probability of each node for each state (habitat preference) split by whether the node was a shift node or not.
If there is no difference in ancestral state reconstruction between shift nodes and non-shift nodes, all the distributions should overlay on each other.
Code
# check whether internal node has a shift in or notd_acr_phytools <-mutate(d_acr_phytools, shift =ifelse(node %in% shiftnodes, 'shift', 'noshift'))ggplot(d_acr_phytools, aes(probability_phytools)) +geom_density(aes(col = shift, fill = shift, probability_phytools, ..scaled..), alpha =0.1) +facet_wrap(~habitat_preference, scales ='free') +scale_fill_manual(values =c('black', 'red')) +scale_color_manual(values =c('black', 'red')) +theme_bw()
Most of the panels overlay each other pretty consistently. However a couple of them do not (i.e. terrestrial and freshwater + terrestrial). The problem is I do not really understand how to interpret this at the moment and think it is kind of meaningless. For example, the panel for freshwater + terrestrial says that shift nodes are predicted to contain more freshwater + terrestrial. But this is true even when the probability of being freshwater + terrestrial in that node is 0.25.
Think how to cleverly look at these things is something I would really like to discuss further.
15 Useful links used
Chapter 13 of Luke Harmon’s book on phylogenetic comparative methods on “Characters and diversification rates”.
Documentation of the R package ggtree used for plotting phylogenies.
Phylo-wiki help page on ancestral state reconstruction.
Github Issue on plotting pie charts for ancestral state reconstructions.
Source Code
---title: "Estimating ancestral states"author: "Daniel Padfield"date: last-modifiedformat: html: toc: true toc-depth: 2 toc-title: 'Contents' code-overflow: wrap code-fold: true code-tools: true self-contained: true self-contained-math: true number-sections: trueexecute: message: false warning: false fig.align: 'center'editor: visual---## OutlineUsing information on the tip states of a phylogenetic tree we can use evolutionary models to estimate the ancestral states of internal nodes and transitions between states.We fit an $Mk$ model of evolution to our tree and habitat preference data to understand how habitat preference evolved. For example does evolution readily allow transitions between specialist habitat preferences (i.e. from marine mud only to terrestrial only).We also try and fit an $Mk$ model of evolution when we combine two traits together (habitat preference and whether or not the species has high or low diversification rate as identified by BAMM).Ancestral state reconstructions can also be estimated from the MuSSE and HiSSE models we are aiming to fit later on. These models would have a lot of free parameters and we are unsure whether they can be fit with the dataset we have. We hope to reduce the number of free parameters in those models by estimating the transition matrix which best fits the data in the $Mk$ model. This would reduce the number of free parameters.## TL;DRGo to @sec-best_model for the best $Mk$ model and a discussion of the transition matrix.## Progress and results- Done ancestral state reconstructions using a variety of methods: - **ape::ace()** - **castor::asr_mk_model()** - **diversitree::fit_mk()** - **phytools::make.simmap()**- Have written a quick primer to [BayesTraits](https://padpadpadpad.github.io/BayesTraitsAcrAllNodes/) but have not used it on our tree due to the time it takes to run the analysis with a transition matrix where all states are different.- Have compared the three standard models of evolution for the transition matrix (equal rates, symmetric rates, and all rates different) using **ape::ace()** and the all-rates different method is best. Then did model simplification to find the best model for the transition matrix.- Have plotted the ancestral state reconstructions of a subset of the tree. For the whole tree my computer gave up.- Have plotted the transition matrix in multiple ways, see @sec-best-model## Fitting transition rate models and doing ancestral state reconstructionsWe can do ancestral state reconstruction with a couple of different methods in R as well as trying in [BayesTraits](http://www.evolution.reading.ac.uk/BayesTraitsV4.0.0/BayesTraitsV4.0.0.html).For discrete traits, the most commonly used model for evolution on tree is called the $Mk$ model. $M$ stands for Markov because the modelled process is a continuous-time Markov chain, and $k$ because the model is generalised to include an arbitrary number ($k$) states.The central attribute of the $Mk$ model is a the transition matrix, $Q$, which gives the instantaneous transition rates between states.An important distinction in ancestral state reconstruction for discrete characters is joint vs. marginal reconstruction. Joint reconstruction is finding the set of character states at all nodes that maximise the likelihood. Marginal reconstruction is finding the state at the current node that maximises the likelihood integrating over all the other states at all nodes, in proportion to their probability.We can do both marginal reconstruction using **ape::ace()** and **castor::asr_mk_model()** and joint reconstruction using **ape::ace()**.An alternative tactic to the one outlined above is to use an MCMC approach to sample character histories from their posterior probability distribution. This is called stochastic character mapping. The model is the same but in this case we get a sample of unambiguous histories for our discrete character's evolution on the tree - rather than a probability distribution for the character at nodes.Finally we can use BayesTraits to do both maximum likelihood and MCMC approaches. We do not use this approach in this walk-through.The evolutionary models that are used to reconstruct ancestral states have a transition matrix between discrete states that define the rates of transitions between states on the tree. In most of these models, there are three that are easy to define: **ARD** means all transition rates are different, **SYM** means transitions to and from the same states are the same, and **ER** is an equal rates model where all rates between states are the same. These transition matrices can also be custom made but for now these three seem sensible starting points. Transition matrices also underpin the MuSSE and BiSSE models we are hoping to fit.## Load in R packagesFirst we will load in R packages used and the metadata file used and wrangled in a previous walk-through.We also load in the final phylogenetic tree, the colours used for habitat preferences, and the position of shift nodes.```{r load_packages}#| results: false# load packageslibrary(here)library(tidyverse)library(ggtree)library(ggnewscale)library(RColorBrewer)library(patchwork)library(ape)library(phytools)library(BAMMtools)library(coda)library(MetBrewer)library(ggpp)library(castor)library(diversitree)library(tidygraph)library(igraph)library(ggraph)library(GGally)library(ggrepel)# set where I am in the projecthere::i_am('scripts/sequencing_rpoB/analyses/estimating_transition_rates.qmd')# read in metadatad_meta <-read.csv(here('data/sequencing_rpoB/processed/asv_metadata.csv'))# read in habitat trait colourscols_hab <-readRDS(here('data/sequencing_rpoB/processed/habitat_colours.rds'))# read in treetree <-read.tree(here('data/sequencing_rpoB/bamm/rerooted-pruned-chronopl10.tre'))# edit tree labelsd_labels <-data.frame(tip_label = tree$tip.label) %>%separate(., tip_label, c('part1', 'part2', 'part3'), sep ='_', remove =FALSE) %>%unite('tip_label_new', c(part1, part2), sep ='_')tree$tip.label <- d_labels$tip_label_new# read in shift nodesshiftnodes <-readRDS(here('data/sequencing_rpoB/processed/shiftnodes.rds'))```## Custom functionWe will write a custom function for getting the transition matrix into a format that is easy to plot using **ggplot2**.```{r custom_functions}# get dataframe of transition ratesace_transition_df <-function(ace_object){ temp_1 <- ace_object$index.matrixcolnames(temp_1) <-rownames(temp_1) <-colnames(ace_object$lik.anc)# turn into dataframe temp_1 <-as_tibble(temp_1, rownames ='state_1') %>%# make long format, the column names become state 2pivot_longer(freshwater:terrestrial, names_to ='state_2', values_to ='order') %>%mutate(free_param =ifelse(is.na(order)|order==0, 'no', 'yes'),num_params =length(ace_object$rates)) temp_1$transition_rate <-NA temp_1$transition_rate[temp_1$order ==0] =0# run for loop to input estimated transition ratesfor(i in1:length(ace_object$rates)){ temp_1[temp_1$order %in% i,]$transition_rate <- ace_object$rates[i] }return(select(temp_1, -order, state_1, state_2, transition_rate, free_param, num_params))}```## Create a master dataframe for the trait statesDifferent methods for estimating ancestral states take different values for the traits. Some only allow for numeric values, whereas others allow characters. Some do not allow colons in the name, and one ([MBASR](https://github.com/stevenheritage/MBASR) which uses Mr Bayes) needs the first trait value to be 0. To allow for us to semi-easily link across different methods, we will make a data frame of our unique tip states and then turn them into a numeric vector. The ordering is all done alphabetically.```{r tip_state_dataframe}d_meta <-tibble(tip_label = tree$tip.label) %>%left_join(., d_meta)# make sure order of habitat preference is the same in the trait vector as the tip labelssum(d_meta$tip_label == tree$tip.label) ==length(tree$tip.label)# SUCCESS if TRUE# replace NA of outgroup - make it freshwater:terrestrial - the most common state# phytools::make.simmap cannot take NAshab_pref <-setNames(d_meta$habitat_preference, d_meta$tip_label)hab_pref[is.na(hab_pref)] <-'freshwater:terrestrial'# make habitat preference vector numerichab_pref_num <-as.numeric(as.factor(hab_pref)) -1hab_pref_num <-setNames(hab_pref_num, d_meta$tip_label)hab_pref_num2 <- hab_pref_num +1# coding from numeric to charactercoding <-tibble(hab_pref =unname(hab_pref), hab_pref_num =unname(hab_pref_num)) %>%distinct() %>%mutate(hab_pref2 =gsub(':', '.', hab_pref),hab_pref_num2 = hab_pref_num +1) %>%arrange(hab_pref) %>%mutate(hab_pref_axis =gsub(':', '/ ', hab_pref),hab_pref_axis =gsub('_', ' ', hab_pref_axis),# rename the columns for easy naminginitials =c('F', 'FM', 'FT', 'G', 'M', 'MT', 'T'))coding# save out habitat preference vectorsaveRDS(hab_pref, here("data/sequencing_rpoB/processed/hab_pref_vec.rds"))```This results in a dataframe which allows us to easily link between numeric and character values, and also gives us a vector - of the same length of tips in our tree - which we can use for estimating ancestral states.## Estimate ancestral states using ape::ace()First lets fit each of the models (ER, SYM, and ARD) using **ape::ace()**. We can compare these models using AIC and likelihood ratio tests to see which model fits the tree best.```{r ape_1_save}#| warnings: false#| message: false#| eval: false#| echo: false# with apeancestral_states_ape <-ace(hab_pref, tree, model="ER", type="discrete")ancestral_states_ape1 <-ace(hab_pref, tree, model="SYM", type="discrete")ancestral_states_ape2 <-ace(hab_pref, tree, model="ARD", type="discrete")# save out filessaveRDS(ancestral_states_ape, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape_mac.rds'))saveRDS(ancestral_states_ape1, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape1_mac.rds'))saveRDS(ancestral_states_ape2, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape2_mac.rds'))``````{r ape_1_load}#| echo: false# read in filesancestral_states_ape <-readRDS(here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape_mac.rds'))ancestral_states_ape1 <-readRDS(here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape1_mac.rds'))ancestral_states_ape2 <-readRDS(here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape2_mac.rds'))``````{r ape_1_norun}#| eval: false# do ancestral reconstruction with apeancestral_states_ape <-ace(hab_pref, tree, model="ER", type="discrete")ancestral_states_ape1 <-ace(hab_pref, tree, model="SYM", type="discrete")ancestral_states_ape2 <-ace(hab_pref, tree, model="ARD", type="discrete")```We can do model comparison using both AIC scores and likelihood ratio tests.```{r ape_1_compare}# compare modelsanova(ancestral_states_ape, ancestral_states_ape1)anova(ancestral_states_ape1, ancestral_states_ape2)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2)))mod_compare```From these comparisons we can see that the ARD model, where all rates in the transition rate are allowed to be different, is the best model in terms of both doing likelihood ratio tests of nested models and AIC scores.### Do some rudimentary checks of ape::ace()We can grab the ancestral states from the output of **ace()**. We can do some checks which include checking whether any of the transition rates are negative or whether they have any standard errors that are `NaN`, and whether any of the node states have probabilities \> 1. This hints the model is too complex/has not converged/not doing a good job.We can format the transition matrix from the output of **ace::ape()** to look at the rates. As it is not stated in the **ape::ace()** directory, we assume the rates are entered as in **castor::asr_mk_model()**, where the \[r,c\]-th entry is the transition rate from state r to state c. This is confirmed on this [help page](https://www.r-phylo.org/wiki/HowTo/Ancestral_State_Reconstruction#Reconstructing_Ancestral_States_for_Discrete_Variables).```{r check_ape_ace}# see whether any of the SEs are NA. This hints that its not converged properlyancestral_states_ape$seancestral_states_ape1$seancestral_states_ape2$se# as you can see both the symmetric and all rates different models have NaNs for some/all of the transitions# get marginal likelihood of each ancestral state at each noded_acr_ape <- ancestral_states_ape2$lik.anc %>%as_tibble(rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_ape', names_to ='habitat_preference')d_acr_ape_summary <-group_by(d_acr_ape, node) %>%summarise(total =sum(probability_ape), .groups ='drop') %>%filter(total !=1)nrow(d_acr_ape_summary)# apparently 7 nodes that do not equal 1, but they are very close to 1.```Some of the standard errors for the ARD model are `NaN`, suggesting convergence problems. However, there are only seven internal nodes with probabilities different from 1, so these estimates are not completely crazy.### Plotting transition matricesWe can plot the transition matrices to see what the estimated rates are. This uses our custom function **ace_transition_df()**.The transition matrices plotted here can be interpreted as follows:- the transition rates are from state 1 (on the y axis) to state 2 (on the x axis)- diagonal rates are when state 1 equals state 2 and are therefore not estimated- Red boxes indicate rates which are estimated. This will be useful when we start using custom matrices.```{r plot_transition_matrices}#| fig.height: 5#| fig.width: 12# bind together the transition matrices from each of the standard models for the transition matricesd_models <-bind_rows(ace_transition_df(ancestral_states_ape) %>%mutate(model ='equal rates'),ace_transition_df(ancestral_states_ape1) %>%mutate(model ='symmetric'),ace_transition_df(ancestral_states_ape2) %>%mutate(model ='all rates different')) %>%# add in colums for reordering and naming of axesleft_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2))# plot transition matrices# make heatmap of changesggplot(d_models, aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate), col ='black', width =0.9, height =0.9) +geom_tile(aes(alpha = transition_rate), fill =NA, col ='red', size =1.1, filter(d_models, free_param =='yes'), width =0.9, height =0.9) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5)) +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2') +coord_fixed() +facet_wrap(~model)```### Doing model simplification on the all rates different modelSo we have seen that the all rates different model is better than the SYM model, but we can also use the results of the ARD model to make some of the rates 0.We can view the all rates different transition matrix only.```{r ard_transition_matrix}#| fig.height: 5#| fig.width: 7# ARD transition matrix onlyfilter(d_models, model =='all rates different') %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate), col ='black', width =0.9, height =0.9) +geom_tile(aes(alpha = transition_rate), fill =NA, col ='red', size =1.1, filter(d_models, free_param =='yes'& model =='all rates different'), width =0.9, height =0.9) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2') +coord_fixed()```The transitions where the rates are estimated to be 0 are:- freshwater to marine mud + terrestrial- freshwater + terrestrial to marine mud- generalist to marine mud- marine mud to freshwater + terrestrial- marine mud to generalist- terrestrial to freshwater + marine mud- terrestrial to generalistWe can create a custom matrix and fix these parameters to be 0.```{r custom_matrix}num_states <-unique(hab_pref) %>%length()# we will set up the custom matrix, this has 49 numbers and then we have to set the correct numbers to 0custom_matrix <-matrix(1:num_states^2, nrow=num_states)# make all diagonal numbers NAfor(i in1:nrow(custom_matrix)){ custom_matrix[i,i] <-0}# make specific transitions 0 based on the output of the ARD.custom_matrix[custom_matrix %in%c(14, 19, 26, 28, 31, 32, 36)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix[custom_matrix !=0] <-1:length(custom_matrix[custom_matrix !=0])custom_matrix``````{r ape_2_save}#| eval: false#| echo: false# lets try refit the model with these parameters fixed to zeroancestral_states_ape3 <-ace(hab_pref, tree, model=custom_matrix, type="discrete")saveRDS(ancestral_states_ape3, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape3_mac.rds'))```We can now try and refit the $Mk$ model to the tree and habitat preference data, but using our custom transition matrix.```{r ape_2_load}#| echo: falseancestral_states_ape3 <-readRDS(here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape3_mac.rds'))``````{r ape_2_norun}#| eval: false# lets try refit the model with these parameters fixed to zeroancestral_states_ape3 <-ace(hab_pref, tree, model=custom_matrix, type="discrete")```We can now do model simplification, making sure we compare the simpler model (with fewer free parameters) to the more complex model (the all rates different model). We can also compute the AIC and add it to our table.```{r ape_2_compare}# compare the less complicated model with the more complicated modelanova(ancestral_states_ape3, ancestral_states_ape2)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3)))arrange(mod_compare, aic)```So the model where we explicit remove 0s from the model fitting process fits the model loads better (lower AIC, anova p value = 1 indicating the extra parameters - the ARD model - do not help explain the data any better).Lets now look at the transition matrix from the most recent model to see what can be removed.```{r plot_transition_matrix}#| fig.height: 5#| fig.width: 7# plot transition matrixace_transition_df(ancestral_states_ape3) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(custom_matrix[custom_matrix !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))```The black boxes with 0 in are rates that we fixed to be 0. There are now no other rates that are 0. We will remove the next lowest rate to see whether its presence significantly improves the model fit. This is freshwater + terrestrial to marine mud + terrestrial.```{r custom_matrix2}custom_matrix# input the rates into the custom matrixcustom_matrix2 <- custom_matrix# set some more rates to 0custom_matrix2[custom_matrix2 %in%c(15,20,25,26,31,33)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix2[custom_matrix2 !=0] <-1:length(custom_matrix2[custom_matrix2 !=0])``````{r ape_3_save}#| eval: false#| echo: false# lets try refit the model with these parameters fixed to zeroancestral_states_ape4 <-ace(hab_pref, tree, model=custom_matrix2, type="discrete")saveRDS(ancestral_states_ape4, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape4_mac.rds'))``````{r ape_3_load}#| echo: falseancestral_states_ape4 <-readRDS(here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape4_mac.rds'))```We can now re run the $Mk$ model.```{r ape_3_norun}#| eval: false# lets try refit the model with these parameters fixed to zeroancestral_states_ape4 <-ace(hab_pref, tree, model=custom_matrix2, type="discrete")```We can do model selection as before, but now this new model (fewer parameters) is compared to the previous custom matrix model (more parameters).```{r ape_3_compare}#| fig.height: 5#| fig.width: 7# compare the less complicated model with the more complicated modelanova(ancestral_states_ape4, ancestral_states_ape3)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4)))arrange(mod_compare, aic)# plot transition matrixace_transition_df(ancestral_states_ape4) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(custom_matrix2[custom_matrix2 !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))```These zeroes somehow make the model loads worse, so our best model is apparently **ancestral_states_ape3**. When plotting this model we can see it is doing an extremely bad job.Instead we can try and refit **ancestral_states_ape3** with values \<0.1 set to 0. This will hopefully allow us to move past this problem.```{r ape_4}#| fig.height: 5#| fig.width: 7# create new custom matrixcustom_matrix3 <- custom_matrix# get index of rates which are <0.1which(ancestral_states_ape3$rates <0.1)custom_matrix3[custom_matrix3 %in%which(ancestral_states_ape3$rates <0.1)] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix3[custom_matrix3 !=0] <-1:length(custom_matrix3[custom_matrix3 !=0])# fit habitat preference modelancestral_states_ape5 <-ace(hab_pref, tree, model=custom_matrix3, type="discrete")anova(ancestral_states_ape5, ancestral_states_ape3)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5)))arrange(mod_compare, aic)# plot transition matrixace_transition_df(ancestral_states_ape5) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(custom_matrix3[custom_matrix3 !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))```There are no longer any rates less than 0.1. Instead we will remove the lowest rate and refit the model to see if it is better or worse.```{r ape5}#| fig.height: 5#| fig.width: 7custom_matrix4 <- custom_matrix3custom_matrix4[custom_matrix4 %in%which(ancestral_states_ape5$rates ==min(ancestral_states_ape5$rates))] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix4[custom_matrix4 !=0] <-1:length(custom_matrix4[custom_matrix4 !=0])ancestral_states_ape6 <-ace(hab_pref, tree, model=custom_matrix4, type="discrete")anova(ancestral_states_ape6, ancestral_states_ape5)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5', 'ancestral_states_ape6'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5), AIC(ancestral_states_ape6)))arrange(mod_compare, aic)# plot transition matrixace_transition_df(ancestral_states_ape6) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(custom_matrix4[custom_matrix4 !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))```This parameter was not significant in the model, and we can got an improvement in the AIC score. So this parameter can be removed from the model.We will do this process again.```{r ape6}#| fig.height: 5#| fig.width: 7custom_matrix5 <- custom_matrix4custom_matrix5[custom_matrix5 %in%which(ancestral_states_ape6$rates ==min(ancestral_states_ape6$rates))] <-0# replace values of non-zeroes with correct vector of 1:n parameterscustom_matrix5[custom_matrix5 !=0] <-1:length(custom_matrix5[custom_matrix5 !=0])ancestral_states_ape7 <-ace(hab_pref, tree, model=custom_matrix5, type="discrete")anova(ancestral_states_ape7, ancestral_states_ape6)# check AIC of each modelmod_compare <-tibble(mod =c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5', 'ancestral_states_ape6', 'ancestral_states_ape7'),aic =c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5), AIC(ancestral_states_ape6), AIC(ancestral_states_ape7)))arrange(mod_compare, aic)# plot transition matrixace_transition_df(ancestral_states_ape7) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) %>%ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(custom_matrix5[custom_matrix5 !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))```Ok so this parameter is important to the model. So our best model is **ancestral_states_ape6** according to our model simplification process.We will save this out and also refit the model in **castor** and **diversitree** to see how they compare.## Estimate ancestral states with castor::asr_mk_model()We can do an equivalent analysis using **castor::asr_mk_model()**. We can then compare the transition rates and ancestral reconstruction estimates.We will only fit the model with the best custom transition matrix to compare estimates.```{r acr_castor}saveRDS(ancestral_states_ape6, here('data/sequencing_rpoB/processed/transition_rates/ancestral_states_ape6_mac.rds'))# rename best transition matrix for using it laterbest_matrix <- custom_matrix4rownames(best_matrix) <-colnames(best_matrix) <-sort(coding$hab_pref)saveRDS(best_matrix, here('data/sequencing_rpoB/processed/transition_rates/best_matrix_mk_mac.rds'))# get dataframe of transition matrixd_best_matrix <-ace_transition_df(ancestral_states_ape6) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) # with castorancestral_state_castor <-asr_mk_model(tree, hab_pref_num2,Ntrials =5,rate_model = best_matrix,reroot =FALSE)# check AICancestral_state_castor$AICancestral_state_castor$loglikelihood```We can compare the transition rate estimates.```{r ace_castor_transition}#| fig-height: 5#| fig-width: 7# get the castor matrixd_castor_matrix <- ancestral_state_castor$transition_matrixrownames(d_castor_matrix) <-colnames(d_castor_matrix) <-sort(coding$hab_pref)d_castor_matrix <-as_tibble(d_castor_matrix, rownames ='state_1') %>%pivot_longer(freshwater:terrestrial, names_to ='state_2', values_to ='transition_rate_castor')d_transition_compare <-left_join(d_castor_matrix, d_best_matrix) # plotfilter(d_transition_compare, free_param =='yes') %>%ggplot(., aes(transition_rate, transition_rate_castor)) +geom_point(size =2) +theme_bw() +labs(x ='transition rate ape::ace()',y ='transition rate castor::asr_mk_model') +geom_abline(aes(slope =1, intercept =0))```The values for **ape::ace()** are very similar to those from **castor::asr_mk_model()**. This is good. There is one shift that is estimated to be 0 using **castor**, but it is \> 2 for **ape**. This is the transitionWe can see if this result is the same for **diversitree**, and hints this parameter could be set to 0. However right now we will not do that.## Estimate ancestral states with diversitreeWe can also do the equivalent analysis using **diversitree**. This might be beneficial because it can also be used for BiSSE models later on.It is a bit more involved in terms of defining the constraints on the transition matrix.```{r acr_diversitree}# all of these methods needs a likelihood function, we can build a Mkn modellik <-make.mkn(tree, hab_pref_num2, k =max(hab_pref_num2))# constrain correct parameters to be 0lik_constrain <-constrain(lik, q14~0, q16~0, q17~0, q26~0, q27~0, q35~0, q36~0, q45~0, q47~0, q51~0, q52~0, q53~0, q54~0, q57~0, q63~0, q64~0, q71~0, q72~0, q74~0, q75~0)argnames(lik_constrain)# need to pass start values to it - can grab these from the ape::ace, but we will just pass an average rate to the modelinits <-rep(mean(ancestral_states_ape6$rates), length(argnames(lik_constrain)))# find the maximum likelihood estimates of this modelmod_ml <-find.mle(lik_constrain, inits)# pass inits from the ancestral states model#inits2 <- ancestral_states_ape6$rates#mod_mcmc <- mcmc(lik_constrain, inits2, nsteps = 10, w = 1)unname(mod_ml$par)# save out diversitree modelsaveRDS(mod_ml, here("data/sequencing_rpoB/processed/diversitree_mk_model_mac.rds"))```We can compare the rates estimated from **ape::ace()** and **diversitree::find.mle()** by putting putting the output of diversitree into a dataframe and combining it with our current comparison dataframe.```{r compare_ace_findmle}#| fig-height: 5#| fig-width: 7# grab names of q matrices from diversitreeq_matrix_diversitree <-names(mod_ml$par)# create dataframe with theseq_matrix <-tibble(q_matrix_diversitree = q_matrix_diversitree) %>%# grab first and second elements that are the habitat codingsmutate(state_1_num =as.numeric(substr(q_matrix_diversitree, 2,2)),state_2_num =as.numeric(substr(q_matrix_diversitree, 3,3)),transition_rate_diversitree =unlist(mod_ml$par)) %>%left_join(select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2)) %>%left_join(select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2)) %>%select(-state_1_num, -state_2_num, -q_matrix_diversitree)# combine into comparison dataframed_transition_compare <-left_join(d_transition_compare, q_matrix) %>%select(state_1, state_2, transition_rate, transition_rate_castor, transition_rate_diversitree) %>%mutate(across(starts_with('transition'), round, 2))# plot to see similarityggplot(d_transition_compare, aes(transition_rate, transition_rate_diversitree)) +geom_abline(aes(intercept =0, slope =1)) +geom_point(size =2) +theme_bw()```We can see the estimates are equivalent! This is great news! However the model convergence code is also -1 which hints that the model has not converged. So it is complicated.## Estimate ancestral states using stochastic character mapping using phytools::make.simmap()An alternative approach is to do stochastic mapping of the ancestral states. What this does is it calculates the conditional likelihood for each character state at each node of the tree, including the root, and then simulate ancestral states at each internal node by sampling from the posterior distribution of states.We need to do some initial changes to our transition matrix which we can pass to **make_simmap()** so the $M_{k}$ model is not fitted by this model. Importantly we need to make sure the rows add up to 0, so we need to make the diagonal value of the transition matrix be the negative of the sum of each row.```{r phytools_matrix}# rename best transition matrix for using it laterbest_matrix <- custom_matrix4rownames(best_matrix) <-colnames(best_matrix) <-sort(coding$hab_pref)# get dataframe of transition matrixd_best_matrix <-ace_transition_df(ancestral_states_ape6) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) # get best matrixbest_matrix_estimates <-select(d_best_matrix, state_1, state_2, transition_rate) %>%spread(state_2, transition_rate) %>%column_to_rownames(var ='state_1') %>%mutate(across(everything(), replace_na, 0)) %>%as.matrix()# make diagonal values make things sum to 0diag(best_matrix_estimates) <--rowSums(best_matrix_estimates)```We can now run **make.simmap()** on our tree. We will run it for 100 simulations for now to reduce object space.```{r acr_phytools_save}#| eval: false#| echo: false# do stochastic mapping of character traits using phytools# feed in the best transition matrix previously found using other methodsancestral_states_phytools <-make.simmap(tree, hab_pref, nsim =100, Q = best_matrix_estimates)saveRDS(ancestral_states_phytools, here("data/sequencing_rpoB/processed/transition_rates/simmap_mac.rds"))``````{r acr_phytools_load}#| echo: falseancestral_states_phytools <-readRDS(here("data/sequencing_rpoB/processed/transition_rates/simmap_mac.rds"))``````{r acr_phytools_norun}#| eval: false# do stochastic mapping of character traits using phytools# feed in the best transition matrix previously found using other methodsancestral_states_phytools <-make.simmap(tree, hab_pref, nsim =100, Q = best_matrix_estimates)```It has finished! We can count the number of transitions between each state using **describe.simmap()**. We can then plot these to see the distribution of **numbers of transitions**.```{r acr_phytools_transitions}#| fig-height: 8#| fig-width: 10# summarise number of switches between states and time spent in each statesimmap_summary <-describe.simmap(ancestral_states_phytools, plot=FALSE)# coerce transitions into dataframed_transitions <-as.data.frame(simmap_summary$count, col.names =colnames(simmap_summary$count)) %>%pivot_longer(cols =c(everything(), -N), names_to ='transition', values_to ='n_transitions') %>%separate(transition, c('state_1', 'state_2'), sep =',', remove =FALSE) %>%mutate(label =gsub(',', ' -> ', transition))# plot thesegroup_by(d_transitions, transition) %>%filter(max(n_transitions) >0) %>%mutate(average =mean(n_transitions)) %>%ungroup() %>%mutate(label =fct_reorder(label, desc(average))) %>%ggplot() +geom_histogram(aes(n_transitions), fill ='white', col ='black', binwidth =function(x) (max(x)-min(x))/nclass.FD(x)) +facet_wrap(~ label, scales ='free_x') +theme_bw() +labs(title ='Distribution of number of transitions between states',subtitle ='Facets are ordered by common transitions',x ='Number of transitions',y ='Count')```We can see the number of transitions between states is dominated by the terrestrial and freshwater environments, with relatively few transitions involving the marine mud environment.We can also look at how was spent in each state in each character map iteration.```{r acr_phytools_time}#| fig-height: 8#| fig-width: 10# coerce time into dataframed_time <-as.data.frame(simmap_summary$times, col.names =colnames(simmap_summary$times)) %>%mutate(n =1:n()) %>%select(-total) %>%pivot_longer(cols =c(everything(), -n), names_to ='state', values_to ='time') %>%group_by(n) %>%mutate(prop = time/sum(time)) %>%ungroup()# plot thesegroup_by(d_time, state) %>%mutate(average =mean(prop)) %>%ungroup() %>%mutate(state =fct_reorder(state, desc(average))) %>%ggplot() +geom_histogram(aes(prop), fill ='white', col ='black', binwidth =function(x) (max(x)-min(x))/nclass.FD(x)) +facet_wrap(~ state, scales ='free_x') +theme_bw() +labs(title ='Proportion of time spent in each state',subtitle ='Facets are ordered in terms of time spent',x ='Proportion of time spent in state',y ='Count')```From this plot, we can see that the tree is dominated by four states: freshwater+terrestrial, terrestrial, marine mud, and freshwater. However, what is interesting is that even though marine mud is the third most common state, it is involved in very relatively few transitions. Transitions out of being a marine mud specialist only occur into freshwater+marine mud, and this is the second least common transition.We can also compare the ancestral state estimates garnered from the best **ape::ace()** model and those from the average stochastic character map. These are likely to be extremely similar given we fixed the Q matrix in our stochastic character mapping process to save time.```{r acr_phytools_ace_castor}#| fig.height: 8#| fig.width: 9# grab ancestral state from simmapd_acr_phytools <-as.data.frame(simmap_summary$ace, col.names =colnames(simmap_summary$ace), row.names =row.names(simmap_summary$ace)) %>%rownames_to_column(., var ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_phytools', names_to ='habitat_preference')# grab out marginal likelihood for each node from ape::ace()d_acr_ape <- ancestral_states_ape6$lik.anc %>%as_tibble(rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_ape', names_to ='habitat_preference')# grab out marginal likelihood for each node from castor::asr_mk_model()d_acr_castor <- ancestral_state_castor$ancestral_likelihoodscolnames(d_acr_castor) <-sort(coding$hab_pref)rownames(d_acr_castor) <-rownames(ancestral_states_ape6$lik.anc)d_acr_castor <-as_tibble(d_acr_castor, rownames ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_castor', names_to ='habitat_preference')# grab out marginal likelihood for each node from diversitree::find.mle()d_acr_diversitree <-asr.marginal(lik_constrain, unname(mod_ml$par)) %>%t() %>%as.data.frame(., col.names =colnames(ancestral_states_ape6$lik.anc), row.names =rownames(ancestral_states_ape6$lik.anc)) colnames(d_acr_diversitree) <-colnames(ancestral_states_ape6$lik.anc)d_acr_diversitree <- d_acr_diversitree %>%rownames_to_column(var ='node') %>%pivot_longer(cols =all_of(coding$hab_pref), values_to ='probability_diversitree', names_to ='habitat_preference')# look at similarity between ace, castor, and phytools approachesd_acr_summary <-reduce(list(d_acr_ape, d_acr_phytools, d_acr_castor, d_acr_diversitree), left_join)# can plot all of these reconstructions to see if they are doing similar things# will use ggpairsggpairs(d_acr_summary, columns =c('probability_diversitree', 'probability_ape', 'probability_castor', 'probability_phytools'),mapping = ggplot2::aes(colour = habitat_preference),lower =list(continuous =wrap("smooth", se =FALSE, alpha =0.3, size=0.1))) +theme_bw(base_size =14)```Stochastic character mapping and the maximum likelihood approach estimate ancestral states in slightly different ways. In **ape::ace()**, ancestral states are inferred using marginal state reconstruction, whereas the estimates from the stochastic character map use the joint reconstruction (I think).However, if you force the Q matrix to be maximum likelihood estimates then the estimates for the ancestral states should converge. This is what we see happening here. The diversitree estimates based on the marginal state reconstruction are equivalent to phytools. Something strange is going on with the **castor** estimates that I cannot work out, but three of our methods are extremely close with each other. This is good!## Looking at the best model for ancestral state reconstruction. {#sec-best_model}We can plot the best transition matrix using **ggplot2**.```{r best_transition_matrix}#| fig.height: 5#| fig.width: 7# rename best transition matrix for using it laterbest_matrix <- custom_matrix4# get dataframe of transition matrixd_best_matrix <-ace_transition_df(ancestral_states_ape6) %>%left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%mutate(transition_rate =round(transition_rate, 2)) # make heatmap of changesggplot(d_best_matrix, aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +geom_tile(aes(alpha = transition_rate, col = free_param), width =0.9, height =0.9, size =1.1) +theme_bw(base_size =14) +theme(panel.grid.major =element_blank(),legend.position ='none',axis.text.x.top =element_text(angle =90, vjust =0.5),plot.title.position ="plot") +scale_alpha_continuous(range =c(0, 0.6)) +geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +scale_x_discrete(position ='top', labels = scales::label_wrap(13)) +scale_y_discrete(position ='left', labels = scales::label_wrap(13)) +labs(y ='state 1',x ='state 2',title =paste('all rates different with', length(best_matrix[best_matrix !=0]), 'free parameters', sep =' ')) +coord_fixed() +scale_color_manual(values =c('black', 'red'))ggsave(here('plots/sequencing_rpoB/analyses/transition_matrix_mac.png'), last_plot(), height =6, width =8)```From this transition matrix, we can see some really cool patterns.- The best model has 17 free parameters, down from 42 in the all rates different model.- It is much more common to transition to being more specialist than being more generalist. - generalist -\> freshwater, generalist -\> freshwater/terrestrial, generalist -\> margine/terrestrial have high rates - freshwater/marine -\> freshwater has high rate - marine/terrestrial -\> marine has high rate - Only freshwater/terrestrial has a transition rate INTO generalist- Zero rates generally map to transitions across habitats (i.e. freshwater -\> terrestrial). - This is only non-zero in 1 out of 12 cases.We can have a look at these habitat preferences and their transition rates in a network, to visualise key stepping stones between states and highly connected trait states.We will do this using [ggraph](https://ggraph.data-imaginist.com/articles/Edges.html). I have not implemented this yet.First need to change our transition matrix into a network.```{r make_network_data}#| fig-height: 6#| fig-width: 8# set colourscols_hab <-met.brewer('Austria', n =7)names(cols_hab) <-c('mud_and_shore', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'generalist', 'mud_and_shore:terrestrial', 'freshwater:mud_and_shore')hab_labels <-c('marine mud', 'freshwater', 'terrestrial', 'freshwater + terrestrial', 'generalist', 'marine mud + terrestrial', 'freshwater + marine mud')# calculate mean time spent in each stated_timespent <-group_by(d_time, state) %>%summarise(mean =mean(prop), .groups ='drop')# turn transition matrix into network to plotd_network <-as_tbl_graph(select(d_best_matrix, state_1, state_2, transition_rate)) %>%activate(edges) %>%filter(!is.na(transition_rate) & transition_rate >0) %>%activate(nodes) %>%left_join(., select(d_timespent, name = state, mean)) %>%mutate(label =gsub(':', '/', name),label =gsub('_', ' ', label),label =gsub('mud and shore', 'marine mud', label),order =c(1, 2, 6, 7, 3, 4, 5),hab1 =gsub(':.*.', '', name),hab2 =gsub('.*:', '', name)) %>%arrange(order)p <-ggraph(d_network, layout ='linear', circular =TRUE) +geom_edge_fan(aes(alpha = transition_rate, width = transition_rate),arrow =arrow(length =unit(4, 'mm')),end_cap =circle(10, 'mm'),start_cap =circle(10, 'mm')) +geom_node_point(aes(size = mean,col = name)) +theme_void() +#geom_node_label(aes(label = label, x=xmin), repel = TRUE) +scale_size(range =c(2,20)) +scale_edge_width(range =c(0.5, 2)) +scale_color_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +scale_fill_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist'))# grab data for pointspoint_data <- p$data %>%select(x, y, label) %>%mutate(nudge_x =ifelse(x <0, -0.5, 0.5),nudge_y =ifelse(y <0, -0.3, 0.3),label =gsub('/', '/\n', label))p +geom_label(aes(nudge_x + x, nudge_y+y, label = label), point_data, size = MicrobioUoE::pts(18)) +theme(legend.position ='none',panel.background =element_rect(fill ='white', colour ='white')) +coord_cartesian(clip ="off") +xlim(c(min(point_data$x) +min(point_data$nudge_x) -0.3), max(point_data$x) +max(point_data$nudge_x) +0.3) +ylim(c(min(point_data$y) +min(point_data$nudge_y) -0.1), max(point_data$y) +max(point_data$nudge_y) +0.1)#ggsave(here('sequencing_rpoB/plots/analyses/transition_plot.pdf'), last_plot(), height = 6, width = 8)ggsave(here('plots/sequencing_rpoB/analyses/transition_plot_mac.png'), last_plot(), height =6, width =8)```This plot is AMAZING and has loads of useful biological interpretation.- Marine mud is a common state (larger point), but has very low transition rates away from this state (and can only move into freshwater + marine mud)- Generalist is a rare state and has high rates transitioning away from it.- Rates going from a more polymorphic state to a simpler, more specialised state are generally higher than rates going from a monomorphic to polymorphic state.- Freshwater and its associated polymorphic traits (e.g. freshwater + terrestrial, freshwater + marine mud) are important states that facilitate a lot of transitions between other states.- Only 1 out of 6 transitions from monomorphic states to each other is possible in the best model (freshwater -\> marine mud).## Plot ancestral state reconstructions using ggtreeWe can plot the ancestral states using methods from **ggtree**. For this we need seven colours, one for each state. This is because it is really important to know whether we are confident the ancestral state reconstruction thinks we are in a polymorphic trait, or whether it is stuck between two (or three), monomorphic states.The idea is that habitat transitions may result in a change in rate of speciation/diversification, so we can plot the nodes where transitions are most likely to occur over the tree to see if they align with any habitat transitions.\We tried this on the whole tree, but it broke my Mac so instead we are doing it on node 4416, which was identified as a parent of nodes which contained a high density of potential transitions.```{r plot_nodepies1}#| fig.height: 8#| fig.width: 9#| eval: false# grab out marginal likelihood for each node# will use results from the stochastic character mapping# add new colours to cols_habcols_hab2 <-c(cols_hab, '#00ffdc', '#032a63', '#1e5302')names(cols_hab2) <-c(names(cols_hab), 'freshwater:terrestrial', 'freshwater:mud_and_shore', 'mud_and_shore:terrestrial')# lets plot the nodepies for a subset of the tree. We create a plotly of the bamm transitions and identified a clade with lots of transitions in# lets try on a subset of the data for now. Lets try using all tips from node 4416, where we are pretty confident that a transition occurred in at some point# replace bootstrap value with node number so we can find it again# rename treetree2 <- tree# keep tree data for nodesd_tree <-as_tibble(tree2) %>%filter(!str_detect(label, 'otu'))# check node label is the same as the order in the dataframesum(d_tree$label == tree2$node.label) ==length(tree2$node.label)# replace node label with node numbertree2$node.label <- d_tree$node# subset treetree_sub <-extract.clade(tree2, 4416)# subset tree sub data to just keep nodesd_tree_sub <-as_tibble(tree_sub) %>%filter(!str_detect(label, 'otu'))# tree_sub$node.label <- filter(d_tree, node %in% d_tree_sub$label)# replace node labels in acr dataset with new node numbersd_acr_phytools_sub <-filter(d_acr_phytools, node %in% d_tree_sub$label) %>%pivot_wider(., names_from ='habitat_preference', values_from ='probability_phytools')d_acr_phytools_sub$node <- d_tree_sub$node# grab the relevant shift nodes and convert them to the right number for the subsetted treeshiftnodes_sub <- shiftnodes[shiftnodes %in% d_tree_sub$label]shiftnodes_sub <-tibble(shiftnode_original =as.character(shiftnodes_sub)) %>%left_join(select(d_tree_sub, node, shiftnode_original = label))# create node pies for each node# https://github.com/YuLab-SMU/ggtree/issues/419#issuecomment-877563385# cols_hab[1:3][sort(names(cols_hab[1:3]))]pies <-nodepie(d_acr_phytools_sub, cols =2:ncol(d_acr_phytools_sub), color = cols_hab2[sort(names(cols_hab2))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = habitat_preference), size =2, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab2, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +scale_fill_manual('Habitat preference', values = cols_hab2, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)# plot treep_acr <- tree_plot_sub %<+% d_pie +geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.png'), p_acr, height =10, width =10)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.pdf'), p_acr, height =10, width =10)p_acr# change probabilities to just be marine, freshwater or terrestriald_acr_single <-filter(d_acr_phytools, habitat_preference %in%c('terrestrial', 'mud_and_shore', 'freshwater'))d_acr_double <-filter(d_acr_ape, str_count(habitat_preference, ':') ==1) %>%separate_rows(habitat_preference, sep =':') %>%mutate(probability_ape = probability_ape/2)d_acr_general <-filter(d_acr_ape, habitat_preference =='generalist') %>%group_by(node) %>%do(data.frame(habitat_preference =c('terrestrial', 'mud_and_shore', 'freshwater'), probability_ape = .$probability_ape/3)) %>%ungroup()# combine to have just three habitatsd_acr_ape2 <-bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%group_by(node, habitat_preference) %>%summarise(probability_ape =sum(probability_ape), .groups ='drop') %>%pivot_wider(names_from ='habitat_preference', values_from ='probability_ape')pies <-nodepie(d_acr_phytools_sub, cols =2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size =2, shape =21, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +scale_fill_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)```We can see that lots of the shift changes are associated with nodes where the model is confident in the ancestral state (either freshwwater + terrestrial or marine mud). To me there is not clear evidence that the nodes where shifts are thought to occur are in areas where a transition is likely to have occurred.An alternative approach here is to not only plot each monomorphic state in the pie charts at the node, and split each polymorphic state into the probability of it either being marine mud, terrestrial, or freshwater. I do not think I favour this approach in the long-term, but it produces a graph that is slightly more consistent with our previous colour schemes.```{r plot_nodepies2}#| fig.height: 8#| fig.width: 9#| eval: false# change probabilities to just be marine, freshwater or terrestriald_acr_single <-filter(d_acr_phytools, habitat_preference %in%c('terrestrial', 'mud_and_shore', 'freshwater'))d_acr_double <-filter(d_acr_phytools, str_count(habitat_preference, ':') ==1) %>%separate_rows(habitat_preference, sep =':') %>%mutate(probability_phytools = probability_phytools/2)d_acr_general <-filter(d_acr_phytools, habitat_preference =='generalist') %>%group_by(node) %>%do(data.frame(habitat_preference =c('terrestrial', 'mud_and_shore', 'freshwater'), probability_phytools = .$probability_phytools/3)) %>%ungroup()# combine to have just three habitatsd_acr2 <-bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%group_by(node, habitat_preference) %>%summarise(probability_phytools =sum(probability_phytools), .groups ='drop') %>%filter(., node %in% d_tree_sub$label) %>%pivot_wider(names_from ='habitat_preference', values_from ='probability_phytools')d_acr2$node <- d_tree_sub$nodepies <-nodepie(d_acr2, cols =2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha =1)d_pie <- tibble::tibble(node=as.numeric(d_acr2$node), pies=pies)# plot node pies on the tree with transition positiontree_plot_sub <-ggtree(tree_sub, layout ='circular', branch.length ='none') %<+%filter(d_meta, tip_label %in% tree_sub$tip.label) +new_scale_color() +geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size =2, shape =21, position =position_jitter(width =1, height =0)) +scale_color_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +scale_fill_manual('Habitat preference', values = cols_hab, labels =c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col ='red', shape =21, fill =NA,size=8)# plot treep_acr <- tree_plot_sub %<+% d_pie +geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.png'), p_acr, height =10, width =10)ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.pdf'), p_acr, height =10, width =10)p_acr```## How to compare BAMM results to ancestral state reconstructionsA key question here is how to link the BAMM analysis to these ancestral state reconstructions. One idea I had was to compare all the nodes which have rate shifts to the rest of the nodes in terms of marginal probabilities of ancestral states.We can pretty easily do this by filtering the data on marginal state probabilities by whether they have been identified as a shift node or not.We probably need to control for time from tip in this analysis somehow. Obviously deeper tips are more likely to be less certain in the evolutionary model.This plot shows a density plot for the marginal probability of each node for each state (habitat preference) split by whether the node was a shift node or not.If there is no difference in ancestral state reconstruction between shift nodes and non-shift nodes, all the distributions should overlay on each other.```{r shift_node_explore}#| fig-height: 8#| fig-width: 10#| eval: false# check whether internal node has a shift in or notd_acr_phytools <-mutate(d_acr_phytools, shift =ifelse(node %in% shiftnodes, 'shift', 'noshift'))ggplot(d_acr_phytools, aes(probability_phytools)) +geom_density(aes(col = shift, fill = shift, probability_phytools, ..scaled..), alpha =0.1) +facet_wrap(~habitat_preference, scales ='free') +scale_fill_manual(values =c('black', 'red')) +scale_color_manual(values =c('black', 'red')) +theme_bw()```Most of the panels overlay each other pretty consistently. However a couple of them do not (i.e. terrestrial and freshwater + terrestrial). The problem is I do not really understand how to interpret this at the moment and think it is kind of meaningless. For example, the panel for freshwater + terrestrial says that shift nodes are predicted to contain more freshwater + terrestrial. But this is true even when the probability of being freshwater + terrestrial in that node is 0.25.Think how to cleverly look at these things is something I would really like to discuss further.## Useful links used- [Chapter 13](https://lukejharmon.github.io/pcm/chapter13_chardiv/#ref-Beaulieu2016-ww) of Luke Harmon's book on phylogenetic comparative methods on "Characters and diversification rates".- [Tutorial](https://www.zoology.ubc.ca/prog/diversitree/doc/diversitree-tutorial.pdf) on using the R package **diversitree**.- [Documentation](https://yulab-smu.top/treedata-book/) of the R package **ggtree** used for plotting phylogenies.- [Phylo-wiki](https://www.r-phylo.org/wiki/HowTo/Ancestral_State_Reconstruction) help page on ancestral state reconstruction.- [Github Issue](https://github.com/YuLab-SMU/ggtree/issues/419#issuecomment-877563385) on plotting pie charts for ancestral state reconstructions.